The Z-loss: a shift and scale invariant classification loss belonging to the Spherical Family

نویسندگان

  • Alexandre de Brébisson
  • Pascal Vincent
چکیده

Despite being the standard loss function to train multi-class neural networks, the log-softmax has two potential limitations. First, it involves computations that scale linearly with the number of output classes, which can restrict the size of problems that we are able to tackle with current hardware. Second, it remains unclear how close it matches the task loss such as the top-k error rate or other nondifferentiable evaluation metrics which we aim to optimize ultimately. In this paper, we introduce an alternative classification loss function, the Z-loss, which is designed to address these two issues. Unlike the log-softmax, it has the desirable property of belonging to the spherical loss family (Vincent et al., 2015), a class of loss functions for which training can be performed very efficiently with a complexity independent of the number of output classes. We show experimentally that it significantly outperforms the other spherical loss functions previously published and investigated. Furthermore, we show on a word language modeling task that it also outperforms the log-softmax with respect to certain ranking scores, such as top-k scores, suggesting that the Z-loss has the flexibility to better match the task loss. These qualities thus makes the Z-loss an appealing candidate to train very efficiently large output networks such as word-language models or other extreme classification problems. On the One Billion Word (Chelba et al., 2014) dataset, we are able to train a model with the Z-loss 40 times faster than the log-softmax and more than 4 times faster than the hierarchical softmax.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Truncated Linear Minimax Estimator of a Power of the Scale Parameter in a Lower- Bounded Parameter Space

 Minimax estimation problems with restricted parameter space reached increasing interest within the last two decades Some authors derived minimax and admissible estimators of bounded parameters under squared error loss and scale invariant squared error loss In some truncated estimation problems the most natural estimator to be considered is the truncated version of a classic...

متن کامل

Estimation of Scale Parameter Under a Bounded Loss Function

     The quadratic loss function has been used by decision-theoretic statisticians and economists for many years.  In this paper  the estimation of scale parameter under a bounded loss function, which is adequate for assessing quality and quality improvement, is considered with restriction to the principles of invariance and risk unbiasedness. An implicit form of minimum risk scale equivariant ...

متن کامل

ESTIMATION OF SCALE PARAMETER UNDER A REFLECTED GAMMA LOSS FUNCTION

In this paper, the estimation of a scale parameter t under a new and bounded loss function, based on a reflection of the gamma density function, is discussed. The best scale-invariant estimator of tis obtained and the admissibility of all linear functions of the sufficient statistic, for estimating t in the absence of a nuisance parameter, is investigated

متن کامل

Bayesian Estimation of Shift Point in Shape Parameter of Inverse Gaussian Distribution Under Different Loss Functions

In this paper, a Bayesian approach is proposed for shift point detection in an inverse Gaussian distribution. In this study, the mean parameter of inverse Gaussian distribution is assumed to be constant and shift points in shape parameter is considered. First the posterior distribution of shape parameter is obtained. Then the Bayes estimators are derived under a class of priors and using variou...

متن کامل

On Estimation Following Selection with Applications on k-Records and Censored Data

Let X1 and X2 be two independent random variables from gamma populations Pi1,P2 with means alphaθ1 and alphaθ2 respectively, where alpha(> 0) is the common known shape parameter and θ1 and θ2 are scale parameters. Let X(1) ≤ X(2) denote the order statistics ofX1 and X2. Suppose that the population corresponding to the largest X(2) (or the smallest X(1)) observation is selected. The problem ofin...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1604.08859  شماره 

صفحات  -

تاریخ انتشار 2016